Search CORE

354 research outputs found

Using linguistic information to classify Portuguese text documents

Author: Paulo Quaresma
Teresa Gonçalves
Publication venue: IEEE Computer Society
Publication date: 01/10/2008
Field of study

This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Público newspaper. The results show that sentences' syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents

Repositório Científico da Universidade de Évora

Analysing part-of-speech for Portuguese text classification

Author: Gonçalves Teresa
Quaresma Paulo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

This paper proposes and evaluates the use of linguistic in- formation in the pre-processing phase of text classification. We present several experiments evaluating the selection of terms based on different measures and linguistic knowledge. To build the classifier we used Sup- port Vector Machines (SVM), which are known to produce good results on text classification tasks. Our proposals were applied to two different datasets written in the Portuguese language: articles from a Brazilian newspaper (Folha de So Paulo) and juridical documents from the Portuguese Attorney General’s Office. The results show the relevance of part-of-speech information for the pre-processing phase of text classification allowing for a strong re- duction of the number of features needed in the text classification

Repositório Científico da Universidade de Évora

The Senso Question Answering approach to Portuguese QA@CLEF-2007

Author: Quaresma Paulo
Saias José
Publication venue
Publication date: 01/01/2007
Field of study

This article has the Working Notes about the Universidade de Évora's participation in QA@CLEF2007 (http://www.clef-campaign.org/), based on the Senso question answer system and the Portuguese monolingual task

CiteSeerX

Repositório Científico da Universidade de Évora

A Contribution to Improve the Children's Catalogue of the Public Library

Author: Cosme Sandra
Quaresma Paulo
Publication venue
Publication date: 01/01/2007
Field of study

This paper presents a proposal whose main goal is to contribute for the improvement of the children’s catalogue of the Portuguese Public Library. The context of the communicational paradigm of the contemporary library is described as a foundation that justifies an evolution of the concept of catalogue and purports the emergence of the concept of catalogue for children. The catalogues, collections and the facilities and equipment infrastructures available to young publics in 115 Portuguese public libraries are identified and characterized. The survey is then used as a starting point for our proposal that suggests the implementation of an ontology that, by acting upon the catalogue's substructure, will reveal relations between objects and improve the dialogue space idealized for the children's catalogue of the Public Library

Ibersid (E-Journals)

Repositório Científico da Universidade de Évora

The impact of NLP techniques in the multilabel text classification problem

Author: Gonçalves Teresa
Quaresma Paulo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Support Vector Machines have been used successfully to classify text documents into sets of concepts. However, typically, linguistic information is not being used in the classification process or its use has not been fully evaluated. We apply and evaluate two basic linguistic procedures (stop-word removal and stemming/lemmatization) to the multilabel text classification problem. These procedures are applied to the Reuters dataset and to the Portuguese juridical documents from Supreme Courts and Attorney General’s Office

Repositório Científico da Universidade de Évora

Enhancing a Portuguese text classifier using part-of-speech tags

Author: Gonçalves Teresa
Quaresma Paulo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Support Vector Machines have been applied to text classification with great success. In this paper, we apply and evaluate the impact of using part-of- speech tags (nouns, proper nouns, adjectives and verbs) as a feature selection procedure in a European Portuguese written dataset – the Portuguese Attorney General’s Office documents. From the results, we can conclude that verbs alone don’t have enough informa- tion to produce good learners. On the other hand, we obtain learners with equiva- lent performance and a reduced number of features (at least half) if we use specific part-of-speech tags instead of all words

Repositório Científico da Universidade de Évora

The Senso Question Answering System at QA@CLEF 2008

Author: Quaresma Paulo
Saias José
Publication venue
Publication date: 01/01/2008
Field of study

This article has the Working Notes about the Universidade de Évora's participation in QA@CLEF2008 (http://www.clef-campaign.org/), based on the Senso question answer system and the Portuguese monolingual task

Repositório Científico da Universidade de Évora

SEMONTOQA: A Semantic Understanding-Based Ontological Framework for Factoid Question Answering

Author: Hoque Moinul
Quaresma Paulo
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

This paper presents an outline of an Ontological and Se- mantic understanding-based model (SEMONTOQA) for an open-domain factoid Question Answering (QA) system. The outlined model analyses unstructured English natural lan- guage texts to a vast extent and represents the inherent con- tents in an ontological manner. The model locates and ex- tracts useful information from the text for various question types and builds a semantically rich knowledge-base that is capable of answering different categories of factoid ques- tions. The system model converts the unstructured texts into a minimalistic, labelled, directed graph that we call a Syntactic Sentence Graph (SSG). An Automatic Text In- terpreter using a set of pre-learnt Text Interpretation Sub- graphs and patterns tries to understand the contents of the SSG in a semantic way. The system proposes a new fea- ture and action based Cognitive Entity-Relationship Net- work designed to extend the text understanding process to an in-depth level. Application of supervised learning allows the system to gradually grow its capability to understand the text in a more fruitful manner. The system incorpo- rates an effective Text Inference Engine which takes the re- sponsibility of inferring the text contents and isolating enti- ties, their features, actions, objects, associated contexts and other properties, required for answering questions. A similar understanding-based question processing module interprets the user’s need in a semantic way. An Ontological Mapping Module, with the help of a set of pre-defined strategies de- signed for different classes of questions, is able to perform a mapping between a question’s ontology with the set of ontologies stored in the background knowledge-base. Em- pirical verification is performed to show the usability of the proposed model. The results achieved show that, this model can be used effectively as a semantic understanding based alternative QA system

Repositório Científico da Universidade de Évora

Is linguistic information relevant for the classification of legal texts?

Author: Gonçalves Teresa
Quaresma Paulo
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2005
Field of study

Text classification is an important task in the legal domain. In fact, most of the legal information is stored as text in a quite unstructured format and it is important to be able to automatically classify these texts into a predefined set of concepts. Support Vector Machines (SVM), a machine learning al- gorithm, has shown to be a good classifier for text bases [Joachims, 2002]. In this paper, SVMs are applied to the classification of European Portuguese legal texts – the Por- tuguese Attorney General’s Office Decisions – and the rele- vance of linguistic information in this domain, namely lem- matisation and part-of-speech tags, is evaluated. The obtained results show that some linguistic information (namely, lemmatisation and the part-of-speech tags) can be successfully used to improve the classification results and, simultaneously, to decrease the number of features needed by the learning algorithm

Crossref

Repositório Científico da Universidade de Évora

DI@UE in CLEF2012: question answering approach to the multiple choice QA4MRE challenge

Author: Quaresma Paulo
Saias José
Publication venue: clef2012.org
Publication date: 01/09/2012
Field of study

In the 2012 edition of CLEF, the DI@UE team has signed up for Question Answering for Machine Reading Evaluation (QA4MRE) main task. For each question, our system tries to guess which of the five hypotheses is the more plausible response, taking into account the reading test content and the documents from the background collection on the question topic. For each question, the system applies Named Entity Recognition, Question Classification, Document and Passage Retrieval. The criteria used in the first run is to choose the answer with the smallest distance between question and answer key elements. The system applies a specific treatment for certain factual questions, with the categories Quantity, When, Where, What, and Who, whose responses are usually short and likely to be detected in the text. For the second run, the system tries to solve each question according to its category. Textual patterns used for answer validation and Web answer projection are defined according to the question category. The system answered to all 160 questions, having found 50 right candidate answers

Repositório Científico da Universidade de Évora